

- Project Idea
- Preliminary Investigation
- Project Schedule
- ➤ Deliverables and Success Criteria

### Remind Our Project Idea

Future mobile application?
<u>User-definable ML services!</u>

Hey Siri! Learn "칼국수" and Categorize It! Q 칼국수 Yes, sir!

<u>On-device training</u> will being essential!



Can we do on-device training quickly?

No, because of poor performance!



# Remind Our Project Idea (Cont.)

We focus on ...

Data Movement Bottleneck!

Our approach is ...

Processing-in-Flash (PiF)!



## Questions We Need to Answer for PiF

- 1 Is our hypothesis (data movement will be the bottleneck!) valid?
- 2 Is it possible to combine accelerator with the mobile flash chip?
- 3 If possible, how to derive the specification of a suitable accelerator?
- 4 Can *PiF* really do better than baseline system?

## Questions We Need to Answer for *PiF*

- Is our hypothesis (data movement will be the bottleneck!) valid?
- Is it possible to combine accelerator with the mobile flash chip?
- If possible, how to derive the specification of a suitable accelerator?
- Can *PiF* really do better than baseline system?

In this midterm-presentation, we will address the questions (2) & (3).



- Project Idea
- > Preliminary Investigation
- Project Schedule
- ➤ Deliverables and Success Criteria

### Is it possible to combine accelerator with the mobile flash chip?







Is it possible to combine accelerator with the mobile flash chip?

Is it possible to combine accelerator with the mobile flash chip?

#### Answer: "Yes. There is enough free space"





 $\approx 100 \ mm^2 \ (70\%)$ 

|      | Apple A12 GPU Die Size |  |
|------|------------------------|--|
| Size | < 15 mm <sup>2</sup>   |  |

### How to derive the specification of a suitable accelerator?

### How to derive the specification of a suitable accelerator?



### How to derive the specification of a suitable accelerator?

| Memory Spec    |        |  |  |
|----------------|--------|--|--|
| Memory Spec    | BW     |  |  |
| SRAM           | 15 GB  |  |  |
| NAND Interface | 1.2 GB |  |  |
|                |        |  |  |
| Flash Property | value  |  |  |
| Page Size      | 16 KB  |  |  |
| Cell Type      | SLC    |  |  |
| # of Plane     | 8      |  |  |
| # of Chip      | 4      |  |  |
|                |        |  |  |

| Accel. Data Path |                     |  |
|------------------|---------------------|--|
| Data             | Data path           |  |
| Input            | Page Buf. Interface |  |
| Weight           | Page Buf. Interface |  |
| Medium Result    | SRAM Interface      |  |
| Final Result     | SRAM Interface      |  |



- Project Idea
- Preliminary Investigation
- Project Plan
- ➤ Deliverable and Success Criteria

## **Overall Project Plan**

- 1 Is our hypothesis (data movement will be the bottleneck!) valid?
  - Android Reference Board (HiKey 960, etc.) or Pixel Phone
- ② Is it possible to combine accelerator with the mobile flash chip?
- 3 How to derive the specification of a suitable accelerator?
- **4** Can *PiF* really do better than baseline system?
  - Micro Benchmark w/ Flash Chip Simulator (CoX-Sim)
  - Macro Benchmark w/ Whole System Simulator (T.B.D.)

In Final-presentation, we will answer all the questions. (Especially focused on 4)

- Project Idea
- Preliminary Investigation
- Project Schedule
- > Deliverable and Success Criteria

### **Deliverables and Success Criteria**

| Deliverables                                                        | Success Criteria                                                                    |
|---------------------------------------------------------------------|-------------------------------------------------------------------------------------|
| On-Device Training Benchmark Results                                | Verify that data movement is the bottleneck                                         |
| Quantitative investigation of mobile flash packages and flash chips | Verification of whether it is possible to mount an accelerator<br>on the flash chip |
| Flash Chip Simulator (a.k.a. CoX-Sim) ≈ 80%                         |                                                                                     |
| Whole System Simulator (or Emulator)                                | Proving that the PiF performs better                                                |
| Macro & Micro Benchmark Results                                     |                                                                                     |

